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9.1  Executive  Summary 


The  vision  group  at  Rochester  spent  the  year  investigating  several  aspects  of  parallel  and 
real-time  computer  vision  with  the  overall  goal  of  implementing  a  set  of  basic  sensory-motor 
behaviors  which  could  serve  as  a  foundation  for  more  sophisticated  abilities,  and  integrating  these 
primary  behaviors  into  multi-modal  systems.  The  emphasis  was  on  behaviors  which  had 
relevance  to,  and  could  be  implemented  to  work  robustly  in.  a  broad  range  of  real-woild  environ¬ 
ments  since  these  are  most  likely  to  be  useful  as  fundamental  skills. 

This  work  reflects  the  position  that  the  way  to  make  progress  in  computer  vision  is  to  inves¬ 
tigate  the  sensory-motor  coupling  that  is  necessary  to  carry  out  specific  tasks.  Once  a  basic 
behavioral  repertoire  is  obtained,  its  components  can  be  combined  and  modified  to  produce  sys¬ 
tems  of  increasing  sophistication.  This  approach  depends  critically  on  the  identification  of 
appropriate  foundation^  abilities.  Since  organisms,  including  humans,  presumably  evolved  their 
present  visual  sophistication  through  a  process  akin  to  the  proposed  approach,  one  obvious  source 
of  inspiration  is  in  the  primitive  visual  behaviors  of  animids.  We  have  concentrated  on  two  such 
areas:  gaze  control  and  visual  navigation. 

This  year’s  active  vision  work  centered  around  commissioning  the  Rochester  Robot,  which 
consists  of  a  3  degree  of  freedom,  two-eyed  robot  head  connected  to  a  Puma  761  arm.  The 
camera-robot  system  is  connected  to  a  Datacube  image  processor,  among  other  computational 
hardware,  which  is  fast  enough  to  allow  real-time  visual  motion  control.  The  robot  has  proved  to 
be  fruitful  testbed,  and  a  number  of  behaviors  have  been  implemented,  including  a  vestibulo- 
ocular  reflex  (VOR),  vergence,  and  target  tracking. 

Research  also  continued  into  various  theoretical  aspects  of  computer  vision  including  paral¬ 
lel  evidence  combination,  parallel  object  recognition,  principal  view  analysis,  and  ad^tive  Kal¬ 
man  filtering. 

Individual  activities  were,  briefly,  as  follows.  Paul  Chou  completed  his  PhD  and  is  now  at 
IBM  T.  J.  Watson  Research  Center.  Paul  Cooper  continued  his  work  on  parallel  object  rccogrji- 
tion  and  will  finish  his  thesis  in  1989.  The  Rochester  Robot  was  commissioned  with  much  help 
from  Dave  Tilley  and  Tim  Becker.  Ray  Rimey  and  Rob  Potter  with  help  from  Tilley  and  Tom 
Olson  implemented  several  gaze-control  mechanisms  on  the  robot  Randal  Nelson  joined  the 
faculty  and  the  vision  group  this  fall,  coming  from  the  University  of  Maryland  where  he  worked 
on  problems  in  visual  navigation.  He  expects  to  continue  with  related  work  here.  Chris  Brown’s 
work  in  1988  divides  into  three  phases  that  correspond  with  the  calendar.  From  January  through 
June  he  concentrated  on  reconstruction  and  segmentation  algorithms  implemented  with  Matkov 
Random  Fields  and  data  fusion,  and  to  a  smaller  extent  on  principle  view  (or  aspect  graph)  calcu¬ 
lations  for  non-convex  polyhedral  scenes;  this  work  culminated  in  papers  and  the  PhD  thesis  by 
Paul  Chou  and  the  paper  by  Nancy  Watts.  From  June  through  August  he  concentrated  on  robot¬ 
ics  and  real-time  vision,  especially  commissioning  the  Rochester  Robot  and  implementing  real¬ 
time  vision  demonstrations;  this  woik  is  reported  in  at  least  three  technical  reports,  one  of  which 
has  been  submitted  for  publication.  From  September  through  December  he  was  at  the  University 
of  Oxford,  where  he  implemented  several  versions  of  the  Kalman  filter  for  tracking  and  parameter 
estimation,  produced  a  technical  report  on  the  subject,  implemented  a  simulator  for  the  kinematic 
and  (to  some  degree)  control  behavior  of  the  Rochester  Robot,  and  has  been  learning  about  adap¬ 
tive  and  optimal  control  theory. 
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9J1  Reconstruction  and  Segmentation  in  Parallel  ••  Data  Fusion 


Integrating  disparate  sources  of  information  has  been  recognized  as  one  of  the  keys  to  die 
success  of  general  purpose  vision  systems.  In  the  work  of  P.  Chou  and  C  Brown  [Chou87, 
ChouSSa.  ChouSSb],  data  fusion  is  used  to  accomplish  reliable  segmentation  and  reconstruction 
in  parallel.  The  computatiooi  is  formulated  as  a  labeling  problem.  Local  visual  observatitms  for 
each  image  entity  are  r^rted  as  label  likelihoods.  They  are  combined  consistently  and 
coherently  in  hierarchically  structured  label  trees  with  a  new.  computationally  simple  procedure. 
The  pooled  label  likehhoods  are  fosed  with  the  a  priori  spatial  knowledge  encod^  as  Markov 
Ranttom  Fields  (MRFs).  The  a  posteriori  distribution  of  the  labelings  are  thus  derived  in  a  Baye¬ 
sian  formalism.  A  new  inferetKe  mediod,  called  Highest  Confidence  First  (HCF)  estimation,  is 
used  to  infer  a  unique  labeling  from  the  a  posteriori  distribution.  HCF  has  the  computational 
advantages  of  efficiency  and  piedictaUe  running  time.  It  degrades  gracefully,  and  follows  a 
least-coirunitment  strategy.  ITs  results  are  consistent  with  observable  evidence  and  a  priori 
knowledge,  and  (not  least)  it  out-performs  other  known  methods.  The  comparative  performance 
of  HCF  and  other  methods  has  b^n  empirically  tested  on  synthetic  and  real  scenes,  uang  both 
inrensity  and  sparse  deptft  data  for  sensor  fusion  experimoits. 

This  work  addressed  four  fundamental  open  questions,  and  provided  an  answer  for  each. 

1.  There  was  no  formulation  of  a  consistent  and  cohermt  scheme  of  integrating  early  visual 
observations.  The  contribution  here  was  a  label  tree  that  ctq>tures  increasing  levels  of 
abstraction,  and  a  computadoiudly  simple  {xocedure  for  keeping  it  updated  as  information 
arrives.  Rather  than  update  a  "marginal  belier  as  evidence  accumulates,  which  requires 
propagation  of  local  belief  to  other  sites,  we  maintain  joint  probability  distributions  at  the 
sites  by  decoupling  the  notions  of  external  evidence  and  prior  knowledge. 

2.  This  woric  successfully  incorporates  a  priori  ^dal  knowledge,  encoded  as  potential  energy 
functions  that  determine  the  MRF  distribution.  The  results  of  the  algorithm,  i.e.  the  a  pos¬ 
teriori  distributions,  are  derived  by  combining  the  pooled  external  observations  and  the  a 
priori  information.  MRFs  are  usefid  for  image  modeling  because  the  prior  knowledge  about 
spatial  dependencies  among  the  entities  can  often  be  adequately  modeled  by  considering 
neighborhoods  that  are  small.  Image  entities  like  pixels  are  regularly  structure  and  isotro¬ 
pic,  which  makes  the  priori  distributions  easy  to  formulate.  The  results  are  much  more 
resistant  to  modeling  error  than  are  other  methMs. 

3.  Many  algorithms  for  minimizing  energy  have  been  used  to  assign  labels  via  MRFs:  all  are 
either  slow  or  unreliable.  The  contribution  of  fiiis  work  was  the  HCF  algorithm,  which  is  a 
quasi-deterministic  steepest-descent  search  in  an  augmented  label  space,  which  includes  foe 
label  "uncommitted".  The  HCF  algorithm  assigns  a  label  to  that  image  entity  which  will, 
when  labded,  decrease  foe  global  energy  the  most  Thus  foe  algorithm  can  be  implemented 
with  a  priority  queue  data  structure.  While  this  is  efficient  on  a  sequential  computer,  tech¬ 
niques  for  parallel  implementations  are  of  interest 

4.  Reconstruction  algorithms  (e.g.  "shape  from  X”)  presume  the  job  of  segmentation  is  com¬ 
pleted,  and  segmentation  algorithms  would  like  to  use  information  like  that  provided  by 
reconstruction.  This  work  unifies  the  reconstruction  and  segmentation  processes,  using  as 
an  experimental  data  set  intensity  and  sparse  depth.  This  work  requires  extending  HCF  to 
deal  with  discrete  and  continuous  labels  (i.e.  edge  and  depth  information),  and  requires  the 
use  of  coupled  MRFs. 

Results  from  this  work  are  gratifying  and  have  been  reported  often  throughout  1988  in  vari¬ 
ous  locations.  Included  here  are  results  for  real  scenes  for  finding  boundaries  (fig.  9.2.1)  and  for 
information  fusion  (fig.  9.2.2). 
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Figure  9^.1:  Boundary  Detection  Experiment 

Data,  boundary  results,  and  resulting  energy  measures  for  labeling  (the  lower  the  better),  (a) 
Natural  100x124  image  of  four  plastic  blocks,  (b)  Thinned  and  thresholded  ou^ut  of  Kirsch 
operators,  (c)  TLR  configuration:  energy  4785  (d)  Stochastic  MAP  estimate:  energy  -349.  (e) 
Stochastic  MPM  estimate:  oiergy  -503.  (i)  ICM  (scan-line  visiting  order)  estimate:  energy  -503. 
(g)  ICM  (random  visiting  order)  estimate:  energy  -513.  (h)  HCT  result:  energy  -629. 
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Figure  9JS.2:  Experiments  with  Stereo  Disparity  Data 
a)  200  by  200  disparity  image,  b)  Scene  with  projected  stmctured  light,  c)  Three  disparity 
images,  d)  Perspective  view  of  the  combined  disparity  image,  e)  Locations  of  the  disparity  meas¬ 
urements  overlaid  with  the  TLR  estimate  of  the  intensity  discontinuities. 
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Figure  9JIJZ  (Continued) 

f)  Reconstructed  disparity  map  with  p=0.9S.  0  DiqMiity  discontinuities, 
disparity  map  withpM).9.  i)  Disparity  discontinuities. 


h)  Reconstructed 
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93  Principal  Views 


During  1988.  Nancy  Watts  pursued  research  leading  to  progress  in  the  difficult  proUem  of 
characterizing  the  different  views  presemed  to  an  observer  (in  either  perspective  or  orthographic 
projection)  by  a  nm-  cmvex  pol^iedroiL  This  work  was  a  coittinuatim  of  her  earlier  work 
which  produ^  an  algorithm  for  computing  all  views  of  a  ccmvex  polyhedrcm.  This  research  is 
still  in  progress,  but  has  resulted  in  a  paper  presoited  at  ICPR  [Waa88]. 

The  usual  approach  to  tfus  peoUem  is  from  the  poim  of  view  of  abstract  computational 
geometry,  in  which  existence  proofe  and  non-  constructive  techniques  based  on  them  abound. 
Watts’  work  is  distinguished  by  her  desire  to  specify  data  structures  and  algorithms  that  will  tx)t 
only  enumerate  views  but  will  allow  diem  to  be  used  in  applicatitms.  Her  earlier  wmk  interfaced 
nicdy  with  a  gr^jhics  program  that  produced  samide  images  fiom  any  given  view  regitm  for  con¬ 
vex  ptdyhedra. 

To  stand  a  chance  of  success  in  the  violently  combinatorial  and  getun^cally  comfdex 
situadrm  that  arises  widi  non-amvex  objects.  Watts  restricted  her  work  to  a  large  class  of  objects 
that  includes  many  every-day  manufactured  objects.  She  was  aUe  to  catalog  the  incidence 
phencHnena  diat  take  (dace  in  the  (Mojective  (xocess,  and  use  this  information  to  design  data 
structures  and  algorithms  for  chara^rizing  the  aspect  gra|)h  of  objects  in  her  class.  The  main 
com()utatiotud  tool  is  "plane  sweefiing",  which  is  a  way  to  keep  track  of  regitms  of  3-s{)ace  as 
thdr  vertices  are  encountered  by  a  plane  sweeiiing  throu^  space. 

Bodi  the  |)rinci{)al  view  work  and  the  MRF  worit  are  of  interest  to  us  for  their  aj^lication  to 
matching  structures.  Previous  work  by  Paul  Cooper,  supported  by  the  NAIC.  dealt  with  structure 
matching  as  a  recognition  strategy.  ITie  work  of  Watts  in  the  pol^iedral  domain  opens  the  way 
to  making  com(dete  catalogs  of  0{)aque  object  structure  for  use  by  matchers  such  as  that  of 
Coo|)er.  Matching  with  a  (Muticular  view  gives  information  about  the  orientation,  or  {x>se,  of  die 
object  matched.  The  MRF  work  of  Chou  can  lead  to  {xnbabilistic  and  (larallel  strategies  for  rela¬ 
tional  structure  matching  if  the  MRFs  are  stmctured  non-isotiD{»cally.  Thus  the  MRF  and  prin- 
cifial  view  technologies  can  be  brought  together  for  the  eternal  vision  goals  of  object  recognition 
and  (Xise  detection. 


9.4  Parallel  Object  Recognition 

During  1988  Paul  Cooiier  focussed  on  his  thesis  topic  which  involved  the  general  problem 
of  [larallel  object  recognitioa  The  (larticular  rnstatKe  chosen  for  implementation  was  d^  recog¬ 
nition  of  linker  Toy  objects  from  images. 

One  develo()ment  was  a  solution  to  the  Tinker  Toy  matching  proUem  that  accommodates 
the  geometric  (larameters  of  the  object  That  is,  an  object  is  recognized  not  just  from  its  topol¬ 
ogy.  but  also  tom  the  geometric  diaracteristics  such  as  the  lengths  of  the  (deces  and  the  angles 
between  the  pieces  at  the  junction.  The  key  to  this  solution  was  framing  the  labeling  problem  so 
that  the  geometry  of  the  junctions  was  implicitly  encoded.  When  framed  in  this  manner,  the 
labeling  problem  can  be  solved  by  the  airpUcation  of  the  massively  parallel  constraint  satisfaction 
network  develo|)ed  earlier  by  Swain  and  Cooper  [Swai88].  The  application  of  the  network  to  the 
Tinker  Toy  matching  problem  with  geometry  is  re(X)rted  in  [Cbop88b]. 

Another  develo{»nent  was  the  use  of  domain  specific  information  to  generate  o()timized 
constraint  satisfaction  nets.  Implementing  the  general  form  of  Swain  and  Cooper’s  [Swai88]  net¬ 
work  to  solve  Tinker  Toy  matching  proved  infeasible  due  to  resource  requirements.  But  a  way  of 
exploiting  the  characteristics  of  the  Tinker  Toy  matching  domain  in  order  to  optimize  the  general 
r^twork  was  develoired,  resulting  in  a  smaller  network  that  could  be  (and  was)  built.  Later,  a 
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way  of  specifying  domain  characteristics  for  art>itrary  domains,  was  discovered,  allowing  optim¬ 
ized  networks  to  be  built,  in  an  analogous  manner,  for  any  domain.  This  woilc  is  described  in  two 
papers  by  Cooper  and  Swain  (CoopSSb,  CoopSSc]. 

The  final  and  most  important  development  was  a  method  for  matching  Tinker  Toys  that 
could  incorporate  inexact  and  uncertain  information.  The  crux  of  this  solution  was  the  use  of 
coupled  Maikov  Random  Fields  to  solve,  simultaneously,  the  segmentation  and  matching  prob¬ 
lems  in  the  Tinker  Toy  domain.  The  architecture  of  the  solution  is  essentially  the  same  as  that  of 
previous  woric  [Coop88a,  Coop88b],  in  that  the  problem  is  framed  as  a  labeling  problem  in  the 
unit/value  connectionist  design  style.  However,  instead  of  adopting  discrete  constraint  satisfac¬ 
tion  [Coop88b]  or  discrete  connectionist  relaxation  [Cbop88a]  as  the  formal  machinery,  Markov 
Random  Fields  (MRFs)  arc  used.  With  a  MRF  representation,  priors  are  combined  with 
hypothesis  likelihoods  to  yield  a  probability  distribution  of  solutions  with  rigorous  Bayesian 
semantics.  The  result  is  a  scheme  that  can  recognize  both  occluded  objects,  and  ones  obscured  by 
noisy  data. 

A  final  report  on  the  MRF  project  as  well  as  everything  else  will  be  available  in  the  thesis 
[Coop89], 


9.5  The  Rochester  Robot 


During  the  summer  of  1988  a  team  at  the  University  of  Rochester  commissioned  the  roches- 
ter  robot,  consisting  of  a  Unimate  PUMA  761  arm  and  a  three-dof,  two-eyed  robot  head 
[Brow88a].  The  results  of  that  effort  are  repoited  in  tire  technical  report,  and  have  to  some  extent 
appeared  dsewhere  [Brow88b.  Ball88, 01so88].  Thus  this  discussion  will  be  brief,  simply  listing 
the  accomplishments  of  the  summer.  The  majority  of  the  implementations  of  basic  visual  capa¬ 
bilities  mirror  the  capabilities  discussed  below  in  section  9.7;  they  arc  basic  visual  reflexes  or 
skills  that  we  believe  an  active  vision  system  can  use  to  advantage  as  building  blocks  for 
behavior.  In  every  case  our  goal  was  to  produce  behavior  in  real-time. 


9.5.1  Kinematics  and  Coordinate  Systems 

Brown  and  Rimey  undertook  to  decipher  the  documentation  accompanying  the  robot  arm 
and  to  produce  a  report  on  its  semantics  arid  syntax,  and  to  solve  some  basic  problems  of  inverse 
kiirematics  and  coordinate  conversion  to  support  elementary  work  with  the  robot.  This  effort  is 
documented  in  [Brow88b]. 


9.5.2  Communication  between  Sensing,  Ct^itive,  Control,  and  Effecting  Systems 

Much  software  engineering  and  production  effort  was  expended  to  produce  basic  interfaces 
and  to  program  die  sensori-motor  interfaces.  These  included  the  MaxVideo  system  from  Data- 
cube,  whidi  was  the  province  of  Dave  Tilley,  and  the  communication  between  the  host  Sun  com¬ 
puter  and  the  Unimate,  which  was  the  province  of  Tim  Becker. 

The  parallel  pipelined  MaxVideo  system  is  the  heart  of  pur  real-time  vision  capability,  and 
it  is  a  very  comidex  machine  with  many  dependencies  of  hmctionality,  timing,  etc.  We  are 
developing  programming  tools,  standard  configurations,  and  abstract  models  that  make  our  life 
easier  as  users. 

The  robot  as  delivered  is  directly  controllable  only  from  its  terminal,  though  communica¬ 
tion  with  other  computer  systems  is  possiUe.  Imported  software  from  another  institution,  to 
allow  control  commands  to  be  sent  from  the  sun,  proved  inadequate.  Tim  Becker  reengineered 


7 


the  entile  system,  and  his  code  is  the  basis  for  our  current  work  (foough  it  will  need  to  be  changed 
if  we  acquire  a  new  VME-based  controller). 


Real-Time  Adaptive  Tracking 

This  capability  mirrors  the  "smooth  pursuit"  system  (see  9.7.).  The  idea  here  is  to  extract 
an  image  pa^  and  use  it  as  a  real-time  correlation  template.  There  are,  however,  a  number  of 
engineering  problons  that  give  die  problem  a  litde  spice.  In  our  case  there  were  issues  of  de¬ 
interlacing  the  video,  pre-processing  the  irqxit  to  make  it  more  likely  that  the  linear  correlation 
(implemented  by  MaxVideo)  would  be  adequate,  and  of  getting  the  hardware  to  perform  as  it 
should. 

Dave  Tilley  and  Dana  Ballard,  with  occasional  kibitzing  by  Chris  Brown  and  others, 
achieved  a  work^  system  that  successfully  tracks  moving  objects.  A  useful  ciqialHlity  we  have 
not  implemented  is  to  re-acquire  the  correlation  template  in  each  image,  thus  compensating  for 
small  systematic  variatitms  of  the  target’s  ai^iearance  between  frames  as  a  consequence  of  its 
motion. 


9.5.4  Vergence 

This  capability  implements  a  gross  vergence  system  (see  9.7).  Here  vergence  is  based  on  a 
global  disparity  calculate  between  subsampled  left  and  ri^t  images.  Thus  it  reflects  large-scale 
image  {dioiomena,  not  high-resolution  ones  (in  this  it  differs  from  the  implemraitation  in  9.7,  but 
the  effect  is  die  same).  The  work  is  reported  in  [01so88]. 

The  basic  image-processing  mechartism  for  implementing  the  global  disparity  calculation  is 
the  cepstral  filter,  whi^  is  defined  as  the  fourier  transform  of  the  logaridun  of  the  power  spec¬ 
trum.  This  operation  is  equivalent  to  correlating  the  left  and  right  images,  using  a  rwrilinear 
operation  to  sharpen  the  correlation  peaks.  The  computation  leads  to  a  measure  of  global  dispar¬ 
ity  in  image  x  and  y,  which  is  translated  into  radians  of  rotation  via  a  small-angle  ai^rroximation. 
Applying  the  compensating  rotation  verges  the  camera. 

This  system  was  implemented  on  the  Euclid  Digital  Signal  Processing  computer  in  die 
MaxVideo  system.  It  incorporates  several  engineering  niceties,  and  it  currently  processes  a  pair 
of  32x32  subsampled  images  in  40  ms.  There  are  several  ideas  for  further  work  here,  including 
using  the  cepstral  filter  in  a  distributed  application  and  also  producing  a  coarse  depth  map  by 
local  cepstral  ctnnputations. 


9.5.5  Kinetic  Depth 

The  object  of  this  work  is  to  produce  a  depth  map  in  real  time  using  optic  flow  produced  by 
head  motions  and  knowledge  about  those  head  motions.  The  work  is  reported  in  [Ball88].  The 
idea  is  simply  that  the  retinal  flow  of  a  patch  of  image  of  a  static  3-D  scene  induced  by  a  head 
motion  depends  on  the  depth  of  the  scene  producing  the  image  patch  and  upon  the  head  motion. 
It  also  varies  with  the  fixation  of  the  eyes.  If  the  eyes  fixate  a  patch  of  scene  during  head  motion 
(using  either  tracking  or  vestibular  fe^back  for  example),  optic  flow  is  zero  at  that  point  Thus 
with  fixation  Idnetic  depth  provides  depth  information  relative  to  the  fixation  point 

The  kinetic  depth  calculation  is  a  combination  of  simple  geometry  and  the  Hom-Shunck 
optic  flow  calculation,  and  results  in  an  expression  that  can  be  implemented  in  lookup  table.  It 
involves  derivatives  of  image  intensity  through  time  and  intensity  across  space.  In  our  hardware 
implementation,  tracking  was  implemented  by  a  loop  in  the  MaxVideo  that  used  hardware  to 
report  the  (x,y)  position  of  over-threshold  image  pixels.  Derivatives  were  computed  by 
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convolution  on  the  ima^-processing  boards,  and  depth  was  computed  by  a  lockup  table.  Ideally, 
the  system  produces  full-frame  depth  maps  at  pixel  resolution  at  frame  rates.  Actually  the 
amount  of  depth  information  depends  on  the  presence  of  temporal  and  spatial  cterivatives  in  the 
image. 


9^.6  Vestibulo-Ocular  Reflex  (VOR)  and  Dynamic  Segmentation 

The  VOR  is  a  reflex  that  stabilizes  images  on  the  retina  to  compensate  for  head  motion. 
Stabilization  aids  low-level  vision  by  keeping  edges  sharp,  and  reducing  motion  blur.  We 
noticed  that  motion  blur  could  cmitiibute  positively  to  image  segmentation  if  it  could  be  used  to 
blur  objects  that  were  NOT  to  be  attended.  Thus  it  » t  uld  reduce  high-frequency  image 
phenomena  such  as  edges  and  textures  that  are  distracting  to  segmentation  algorithms. 

We  imidemented  the  functional  equivalent  of  the  VOR  using  a  builtin  facility  of  the  VAL-Il 
command  language,  and  implemented  a  motimi-blur  amplifier  in  MaxVideo.  The  results  are  gra¬ 
tifying  -  the  moving  head  causes  severe  blur  of  scene  components  that  are  not  fixated,  thus 
throwing  the  fixated  objects  into  strong  relief.  Of  course  an  objea  may  be  t^roximately  fixated 
before  it  is  completely  recognized  or  even  segmented  from  an  image,  on  the  basis  of  partial  infor¬ 
mation  about  tlK  volume  of  3-space  it  occupies.  This  work  was  done  by  Chris  Brown  and  Ray 
Rimey,  on  the  suggestion  of  Dana  Ballard. 


9.5.7  The  Integration  Workbench 

A  flexible  piece  of  software  was  written  by  Ray  Rimey  and  used  to  integrate  two  tnnocular 
stereo  algorithms,  the  VOR-based  segmentation  mol,  and  to  prove  our  understanding  of  the  robot 
kinematics  and  coordinate  systems.  Other  capabilities  have  probably  been  added  to  it  by  now. 

The  code  implements  the  behavior  of  first  lotddng  around  the  lab  for  black  blobs  in  space 
(these  are  squash  balls  hanging  from  the  ceiling),  and  that  locating  tlKsm  in  left  and  right  images 
and  calculating  their  3-D  locations.  A  catalog  or  map  of  3-D  blob  positions  is  kept,  and  used  to 
perform  operations  on  the  balls  such  as  touching  them  with  a  probe  mounted  to  the  head,  or  fixat¬ 
ing  them  while  moving  and  running  the  motion  blur  routine  on  the  MaxVideo  hardware,  thus 
separating  them  even  more  from  the  background. 

This  code  represents  our  first  attempt  to  integrate  several  visual  and  motor  capabilities  in 
one  program,  and  it  is  remarkably  smooth  and  fast  It  is  the  first  piece  of  software  we  have  writ¬ 
ten  ^t  dyiuimically  reccmfigures  tiie  connectivity  (and  tiius  function)  of  the  MaxVideo  system  as 
a  task  progresses. 


9.4.8  Color  vision 

Late  in  1988  we  acquired  color  vision  capability  in  the  form  of  a  miniature  color  CCD  cam¬ 
era  which  can  be  mounted  on  the  robot  head.  One  of  the  first  applications  was  a  real-time  color 
histogrammer  which  was  developed  by  Rob  Potter.  The  color  histogrammer  runs  on  the  MaxVi¬ 
deo  frame-rate  image  processing  system.  It  takes  input  from  an  RGB  color  video  camera,  and 
displays  a  two-dimensional  projection  (e.g.  redness  versus  blueness)  of  the  three  dimensional 
color  histogram.  The  histogrammer  is  useful  for  developing  insights  about  the  way  the  colors  of 
a  scene  change  as  the  illumination  changes  or  as  the  observer  moves  about  Ihe  histograms 
themselves  are  expected  to  be  useful  for  solving  the  color  constancy  problon  and  for  performing 
fast  candidate  screening  in  multi-objea  recognition  systems.  Future  research  will  involve  train¬ 
ing  a  neural  net  to  determine  and  discount  the  illuminant  in  a  scene  from  tiie  data  in  the  color  his¬ 
togram. 
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9.6  Kalman  Filtering  and  Optimal  Estimation  Experiments 


The  EKF  Utility  software  package  is  meant  to  suf^it  various  aspects  of  current  research, 
and  thus  is  an  evolving  entity.  This  report  is  meant  to  give  a  brief  sns^)^t  of  our  recoit  work  in 
implementing  and  experimoiting  with  Kalman  filter  techniques.  The  discussion  section  presents 
opinions  on  the  strengths  and  weaknesses  of  optimal  estimation  techniques  in  perceptual  and  con¬ 
trol  applications. 

Kalman  filtering  is  a  form  of  optimal  estimation  characterized  by  recursive  (i.e.  incremen¬ 
tal)  evaluation,  an  internal  model  of  the  dynamics  of  the  system  being  estimated,  and  a  dynamic 
weighting  of  incoming  evidence  with  ongoing  exp»:tation  that  produces  estimates  of  the  state  of 
the  observed  system.  Neither  the  technical  details  of  Kalman  filtering  in  general  rmr  those  of  the 
variants  we  have  onployed  so  far  will  be  recapitulated  here.  This  report  does  provide  prose 
descriptions,  references,  and  examples.  The  primary  reference  is  [Bar88]. 


9.6.1  Experiments 

The  basic  Kalman  filter  is  an  iterative  loop.  Its  input  is  the  system  measurements;  its  a 
priori  information  is  the  system  dynamics  and  noise  properties  of  system  and  measurement;  and 
its  useful  ouqxrts  are  the  innovation  (the  difference  between  the  predicted  and  observed  meas¬ 
urement.  by  which  the  filter’s  performance  may  be  quantified),  and  the  estimated  system  state. 

Our  first  ai^licadon  of  the  Kalman  filter  was  inspired  by  example  2.4.1  in  [Bar88].  Here,  a 
target  (whose  state  is  its  position  and  velocity)  undergoes  one-dimensional  motion  at  a  velocity 
that  is  constant  but  pertu^d  by  white  noise.  Measurements  yield  ordy  positional  information, 
perturbed  by  an  independent  measurement  noise  process. 

Hie  (first  order)  Extended  Kalman  Filter  (EKF)  is  a  version  of  the  Kalman  filter  that  deals 
with  nonlinear  dynamics  or  nonlinear  measurement  equations,  or  bodi.  It  linearizes  the  problem 
around  the  predicted  state  (a  second-order  EKF  makes  a  second-order  approximation).  The  basic 
filter  control  loop  still  s^lies,  but  measurements  are  predicted  using  the  nonlinear  measuremem 
equation  h,  and  in  the  calculations  for  filter  gain,  state  update,  and  covariance  update,  the 
Jacobean  of  h  is  used.  Likewise  state  prediction  is  acccnnplished  using  the  nonlinear  state  equa¬ 
tion  /and  the  state  prediction  covariance  is  computed  using  the  Jacobean  of  /.  These  generaliza¬ 
tions  call  for  extensions  to  the  EKF  data  structure  in  which  functions  (as  opposed  to  matrices)  are 
attached  to  the  filter. 

The  sample  implication  for  the  EKF  was  to  track  a  moving  target  from  a  moving  observer 
(see  example  3.3.1  in  [Bar88]).  Again  the  target  position  evolves  imder  noisy  constant-velocity 
conditions.  The  observer  travels  on  a  course  parallel  to  the  target,  but  at  a  higher  velocity,  on  a 
platform  whose  position  is  perturbed  randomly  by  some  process.  Tire  observer  measures  the 
look-down  angle  to  the  target  in  front,  which  gradually  increases  as  the  observer  catches  the  tar¬ 
get.  Tire  scalar  angular  measurement  itself  is  noisy,  and  is  used  to  estimate  the  target’s  state 
(position  and  velocity.) 

The  difficulty  with  this  application  is  the  increasing  equivalent  noise  variance  of  the  meas¬ 
urement  caused  by  the  platform  perturbations,  the  highly  leveraged  positional  uncertainty  of  the 
target  due  U)  noisy,  low-angle  measurements,  and  the  non-linearity  of  the  equations.  Our  imple¬ 
mentation  demtxistrates  fire  increasingly  bad  performance  throu^  time  that  is  expected  as  the 
platform  perturbations  affect  the  measurement  accuracy. 

When  targets  maneuver,  i.e.  depart  from  the  basic,  steady-state,  "normal"  dynamic 
behavior,  a  tracking  filter  must  respond.  To  the  filter,  maneuvering  is  signaled  by  a  rapid  increase 
in  the  normalized  innovation.  Recommended  methods  for  dealing  with  this  situation  include  the 
following. 
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1 .  Inciea%  the  process  noise,  or  cetTiin  components  of  it,  attributed  to  the  target. 

2.  Use  several  filters  with  different  assumptions  in  parallel,  and  combine  their  outputs  proba¬ 
bilistically. 

3.  Create  new  filters  as  needed,  pumuing  a  hypothesis  tree  of  parallel  hypotheses  about  target 
state.  This  tree  must  be  pruned  rapidly  lest  its  maintenance  overwhelm  the  computational 
resources. 

4.  Model  maneuvers  as  colored  (correlated)  noise:  in  particular  model  target  acceleration  as  a 
zero-mean,  first-order  Maikov  process  (rnie  with  exponential  autocorrelation). 

5.  Perform  input  estimation,  in  which  measurements  based  on  the  nonmaneuvering  model  are 
used  to  detect  and  estimate  the  control  input  applied  to  the  plant  dynamics,  and  that  control 
input  in  turn  is  used  to  correct  the  state  estimate. 

6.  Use  variable  dimension  filtering  (VD  filtering),  in  which  the  maneuver  is  considered  part  of 
the  plant  dynamics,  not  noise.  Maneuver  detection  causes  the  substitution  of  a  different, 
higher-order  dynamic  model  for  the  lower-order,  “quiescent"  model. 

We  chose  to  implement  the  VD  filter,  in  the  light  of  its  relatively  low  computational  cost 
and  relatively  high  efficacy  (as  shown  by  Bar-Shalom).  Our  illustrative  application  was  to  a  tar¬ 
get  moving  in  two  dimensions  at  constant  velocity  until  some  time  at  which  it  begins  constant 
acceleration  in  the  same  direction.  The  quiescent  filter  is  simply  the  constant-velocity  target 
filter,  the  maneuvering  filter  is  for  a  constant  acceleration  target.  The  example  we  implemented 
involved  a  taiget  that  proceeds  at  constant  velocity  for  a  time,  then  begins  accelerating.  The 
simulation  demonstrates  the  significant  improvement  in  tracking  that  accraes  from  switching  the 
filter  characteristics  when  the  taiget  maneuvers. 

Tracking  an  object  in  the  presence  of  spurious  measurements  (clutter)  can  be  done  in 
several  ways.  All  assume  a  validation  gate  outside  of  which  measurements  are  ignored:  its  size  is 
a  function  of  the  desired  probal^ty  of  including  the  true  measurement,  and  can  be  derived  from 
a  chi-squared  calculation  applied  to  the  normalized  innovation. 

1.  The  optimal  way  to  track  a  single  target  in  clutter  is  the  track-splitting  :q>proach,  in  which  a 
tree  of  possible  tracks  is  maintained.  This  can  be  a  combinatorially  expensive  method. 

2.  An  obvious  alternative  to  treating  all  measurements,  as  it  were,  in  parallel,  is  to  pick  a  sin¬ 
gle  candidate  measurement  and  proceed  as  if  it  were  the  right  one.  The  obvious  candidate 
is  the  one  closest  in  measurement  space  (that  one  with  smaUest  normalized  innovation),  so 
this  technique  is  called  the  nearest-neighbor  standard  filter  (NNSF).  The  problem  is  that 
the  true  measurement  can  be  missed. 

3.  A  third  approach  is  the  probabilistic  data  association  filter  },  or  PDAF.  In  it,  the  measure¬ 
ments  within  the  validation  gate  are  probabilistically  blend^  to  yield  a  combined  innova¬ 
tion  which  is  input  into  the  Kalman  filtering  process.  The  problem  is  that  the  result  does 
not  correspond  to  that  of  any  actual  measuremmt. 

General  aspects  of  the  behavior  of  the  NNSF  and  PDAF  filters  may  be  predictable  on 
abstract  grounds.  For  instance  we  might  make  the  following  predictions  for  uniformly  distributed 
clutter  and  high  probability  of  detection  (probability  that  the  taiget  is  detected  at  all,  either 
inside  or  outside  the  validation  gate.) 

1.  In  the  presence  of  clutter,  both  the  NNSF  and  PDAF  filters  are  increasingly  likely  to  "lose 
track"  as  time  goes  on  (e.g.  start  tracking  clutter,  or  have  an  estimate  of  taiget  state  outside 
some  fixed  bound). 

2.  With  a  "non-maneuvering"  NNSF  filter,  low  and  high  clutter  levels  may  be  less  immedi¬ 
ately  harmful  than  medium  levels,  since  with  low  clutter  the  taiget  is  likely  to  be  nearest  the 
predicted  state,  and  with  high  clutter  there  is  likely  to  be  a  clutter  point  near  the  predicted 
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state.  It  would  seem  that  at  iutennediate  levels  die  clutter  would  be  naore  likely  to  attract 
the  filter  away  firtHU  die  targ^ 

3.  The  NNSF  filter  would  seem  more  likely  to  make  serious  errors  by  tracking  clutter  since  it 
does  not  weigh  the  evidence.  The  performance  of  the  PDAF  should  degrade  rwire  grace- 
fiilly  as  conditions  get  worse. 

We  imidemented  the  NNSF  and  PDAF  filters,  and  used  them  to  provide  individual  ou^t 
tracks,  as  in  die  previous  work.  Also  the  programs  were  onbedded  in  Monte  Carlo  simulations 
to  provide  data  over  a  number  of  runs  in  statistically  similar  situations.  The  results  confirm  the 
above  expectatirms  but  also  (Movide  some  surprises. 

We  simulated  individual  tracking  runs  for  situations  with  uniformly  distributed  clutter  of 
increasing  density.  The  actual  number  of  clutter  points  in  the  validation  gate  was  determined  by 
rounding  a  normal  variate  of  indicated  mean  (the  clutter  density)  and  standard  deviation  to  tte 
nearest  integer,  and  uniformly  distributing  the  resulting  number  of  clutter  points  throughout  the 
validation  gate  volume.  In  these  runs  die  volume  was  dose  to  unity. 

In  both  cases,  the  basic  filter  is  a  constant-velocity  (linear)  Kalman  filter.  The  plant  noise  is 
moiled  by  an  acceleration  component  which  is  white  noise  of  mean  0.0  and  variance  q=Q.2. 
The  clutter  density  standard  deviation  is  0.4.  The  validation  gate  size  was  chosen  so  diat  .99  of 
the  target  measurements  diould  fall  within  it  initially  (as  time  goes  on  this  figure  seems  meaning¬ 
less,  since  the  filter  may  quite  confidently  be  tracking  clutter).  TIk  measur^ent  noise  has  vari¬ 
ance  0.2  for  the  x  state  ccnnponent  and  0.1  for  the  y.  The  validation  gates  are  rectangular  (in  gmi- 
eral  parallelepipedal)  rather  than  ellipsoidal  These  six  figures  illustrate  indeed  that  lower  clutter 
levels  can  result  in  worse  NNSF  performance  than  higher  levels,  and  that  performance  of  both 
filters  falls  off  as  time  increases. 

This  work  perhaps  furnishes  a  mild  surprise,  viz.  the  ability  of  the  PDAF  to  reacquire  the 
track.  This  happens  for  lower  clutter  levels,  and  presumably  occurs  when  the  "signal  to  noise" 
ratio  is  high  in  the  validatioa  gate  (e.g.  there  are  few  measurements  in  the  validation  gate,  and  at 
least  one  of  them  is  near  the  actual  position  of  the  target  and  correctly  is  accorded  high  weight). 
The  behavior  of  the  PDAF  in  high  clutter  conditions  is  not  as  surprising  —  it  drifts,  taking  the 
average  of  the  random  clutter. 

We  also  did  statistics  to  quantify  the  number  of  "lost  tracks",  and  the  average  final  error  of 
the  estimates.  The  average  fin^  error  function  is  meant  to  quantify  the  filter  performance  more 
than  the  discrete  ’lost  track"  measure,  illustrating  a  linear  loss  hmction  corresponding  to  the 
intuitimi  that  an  estimate  closer  to  the  truth  at  the  final  timestep  is  better. 

The  results  are  perhaps  surprising  in  that  by  this  definition  of  lost  track,  the  NNSF  outper¬ 
forms  the  PDAF  fairly  convincingly  over  a  large  range.  These  results  are  typical  of  many  we 
obtained,  but  it  is  occasionally  possible  to  engineer  situations  where  the  PDAF  loses  track  less 
often.  At  least  it  seems  fair  to  say  that  the  situation  is  more  complicated  than  it  appears  from  Fig¬ 
ure  6.1  of  [Bar88],  which  indicates  a  marked  superiority  of  PDAF  along  axes  whose  semantics 
are  not  clear  from  the  text  (perhaps  the  original  prqjer  gives  more  details). 

The  results  are  perhaps  not  surprising  in  that  they  accord  with  our  prediction  of  graceful 
degradation  of  the  PDAF  in  terms  of  the  average  error  metric,  using  which  it  convincingly  out¬ 
performs  the  NNSF.  The  higher  sensitivity  of  NNSF  to  intermediate  clutter  levels  is  again 
demonstrated. 


12 


9.6J2  Condusions 

Optimal  estimation  techniques  have  at  least  three  distinct  roles  to  play  in  real-time  sensori¬ 
motor  systems. 

1.  They  can  be  used  as  the  basic  paradipa  for  estimating  the  date  of  systons  internal  to  dte 
observer.  Estimating  external  states  is  akin  to  perception . 

2.  They  can  be  used  to  estimate  the  state  of  sytems  internal  to  die  observer.  Estimating  inter¬ 
nal  state  involves  aspects  of  proprioception  (using  information  horn  internal  sensors),  but 
can  also  involve  sensing  the  outside  world,  especially  to  determine  dynamic  observer 
parameters  such  as  location  and  velocity. 

3.  They  can  be  used  as  low-level  utilities  in  service  of  several  aspects  of  perception  or  action. 

The  aim  of  mir  work  is  to  examine  the  characteristics  of  optimal  estimation  algoridims, 
including  varieties  of  the  Kalman  filter,  relative  to  the  demands  of  a  paradigm  for  perception  and 
internal  state  estimation.  The  conclusion  is  that  algorithms  of  the  Kalman  filter  style  are  better 
matched  to  roles  in  internal  state  estimation  than  to  paradigms  of  perception.  Their  perfonnance 
as  technical  utilities  to  subserve  basic  sensorimotor  tasks  must  be  evduated  on  a  case-by-case 
basis. 

One  important  role  of  perception  is  to  cope  with  the  unexpected.  This  seeming  truism  is 
often  ignored,  and  has  deep  implications  for  computational  perceptual  models.  In  particular  it 
implies  that  top-down  control  strategies  are  inadequate.  Top-down  (expectation-driven, 
hypothesis-verification)  methods  cope  with  the  inherent  imderdetermined  and  computationally 
intensive  nature  of  perception  by  usirrg  a  priori  knowledge  to  constrain  the  space  of  interpreta¬ 
tions  for  perceptual  data.  A  historic  example  is  Shirai’s  polyhedral  edge-following  program, 
which  "feels"  its  way  around  a  polyhedral  scene  efficiently  tracing  edges  it  expects,  but  ignoring 
those  arising  from  phenomena  (such  as  holes  in  faces)  not  in  its  model  of  the  polyhedral  domain. 

The  Offsite  control  strategy  is  bottom-up ,  or  data-driven:  Here  the  style  is  often  a  fixed 
order  of  processing  of  input  data  (say  by  successive  levels  of  feature  detection  and  extraction) 
leading  to  increasingly  al^act  levels  of  descrij^on  of  the  input.  As  technology  improves  it  is 
becoirung  possible  to  achieve  the  massive  data-processing  effort  in  real  time,  and  die  practical 
considerations  that  have  partially  motivated  the  top-down  approach  are  vanishing  (see,  e.g. 
[Brow88a). 

In  one  sense,  the  Kalman  filter  is  an  example  of  expectation-driven  perceptiort  In  particu¬ 
lar,  the  Kalman  filter  by  definition  explicitly  incorporates  a  model  of  the  dynatrucs  of  the  plant 
producing  the  data  to  be  interpreted.  Also,  faithM  models  of  noise  processes  in  the  plant  and 
sensor  are  needed  if  the  filter  is  to  produce  optimal  results.  The  stren^  of  the  K  dman  filter  for 
estimation  is  that  it  has  these  models  at  its  disposal,  but  requiring  them  limits  the  sorts  of  percep¬ 
tual  jobs  that  the  Kalman  filter  can  reasonably  be  expected  to  perform.  The  problems  with  a  strict 
top-down  approach  can  be  mitigated  to  some  extent,  and  at  some  cost,  by  such  measures  as  run¬ 
ning  several  different  filters  embodying  different  assumptions  in  parallel,  switching  between 
filters  when  lack  of  fit  motivates  such  a  switch,  allowing  the  filter  to  estimate  control  inputs  to  the 
plant,  etc.,  as  we  have  seen  in  earlier  sections. 

Despite  such  seemingly  sophisticated  adaptive  capabilities,  the  extensive  literature  on  Kal¬ 
man  filtering  applications  reveals  that  the  petceimial  tasks  most  often  attempted  are  those  in 
which  the  plant  (often  target  )  follows  well-known  and  rather  simple  (e.g.  ballistic)  dynamic 
laws,  and  in  which  the  target  is  modeled  as  a  point  in  space.  The  typical  perceptual  task  is  track¬ 
ing  the  target,  perhaps  as  it  maneuvers  or  is  immersed  in  "clutter"  (false  targets).  Thus  the  per¬ 
ceptual  task  consists  of  the  twofold  problem  of  linking  measurements  together  into  tracks  and 
ignoring  spurious  data.  In  the  tracking  literature  this  is  known  as  the  data  association  problem. 
The  basic  Kalman  filter  mechanism  provides  help  in  the  way  of  quantified  measures  of 
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uncertainty,  surprise,  information,  expectation,  etc.  but  provides  nothing  directly  to  cope  with  the 
familiar  problems  of  controlling  search  in  inteipretation  space. 

Thus  we  see  that  in  its  role  in  the  data  association  problem,  which  is  known  in  computer 
vision  as  the  segmentation  problem,  tl^  Kalman  filter  operates  in  a  local  and  bottom-up  way, 
providing  incremental  datapoints  to  a  higher-level  algorithm  responsible  for  grouping  the  points 
into  coherent,  semanticaliy  meaningful  sets.  As  a  result  the  control  mechanisms  in  perceptual 
algorithms  that  manage  the  filters  have  a  character  in  computer  vision  applications.  The  track¬ 
splitting.  NNSF,  PDAF,  etc.  ai^>roaches  all  have  analogs  in  the  edge-linking  problem  in  computer 
vision,  for  example. 

Finally,  even  in  the  constrained  perceptual  tasks  to  which  they  are  applied.  Optimal  estima¬ 
tion  schemes  need  a  rich  flow  of  data.  Katoan  filters  in  the  literature  oftra  take  on  the  order  of 
tens  of  measurements  to  converge. 


9.7  Robot  Head  and  Gaze  Control 

Behaving,  actively  intelligent  systems,  whether  mechanical  or  biological,  must  manage 
their  computational  and  physical  resources  in  qjpropriate  ways  in  order  to  survive  and  to  accom¬ 
plish  tasks.  At  Rochester  we  are  building  an  integrated  actively  intelligent  system  that  incor¬ 
porates  abstract  reasoning  (planning),  sensing,  and  acting.  The  "active  vision”  paradigm  we  shall 
exploit  incorporates  the  following  ideas. 

1.  A  hierarchy  of  control,  so  that  the  hi^st  cognitive  levels  can  reason  in  terms  of  "what" 
they  want  done  ratter  than  "how"  to  do  it  in  detail.  This  hierarchy  should  extend 
throughout  the  system. 

2.  At  the  lower  levels,  the  control  hierardiy  ends  with  quasi-reflexive,  visual  and  motor  capa¬ 
bilities  or  "skills".  These  capabilities  are  cooperative  but  to  some  extent  independently  con¬ 
trollable.  Some  are  always  ruiming,  and  they  form  the  building  blocks  on  which  more  com¬ 
plex  behavior  is  built  Examples  are  nuU^g  out  retinal  slip  to  minimize  motion  blur, 
redirecting  gaze  as  a  result  of  attentional  shifts,  etc. 

3.  Part  of  the  job  of  low-level  visual  cr^rabilities  is  to  present  perceptual  data,  such  as  flow 
fields  or  depth  mr^,  to  higher-level  visual  processes.  These  processes  can  often  benefit 
from  knowledge  of  self-initiated  motion  on  the  part  of  the  sensing  entity.  They  can  often  be 
built  on  the  low-level  control  capabilities. 

The  work  undertaken  in  the  latter  part  of  1988  was  directed  at  the  control  aspects  of  the 
lower-level  visual  processes.  In  the  work  reported  above  in  section  9.5,  certain  of  these  control 
capabilities  (e.g.  tracking)  were  implement^  and  used  as  the  foundation  of  real-time  visual 
skills.  For  instance,  the  ability  to  track  was  used  in  an  algorithm  that  computed  relative  depth  at 
frame  rates  [Ball88].  In  the  work  reported  below,  a  ^ulation  of  the  robot  head  and  eyes  is  used 
to  examine  the  effects  of  differem  ^les  of  interaction  between  certain  control  crq>abilities  that 
are  found  in  primates.  This  report  simply  outlines  the  simulatirxi  and  indicates  the  results  of  the 
most  simple  possible  interaction  between  the  control  loops  --  no  interaction  at  all  until  their  out¬ 
puts  are  summed  at  the  effectors. 

The  simulation  software  is  based  on  the  robot  kinematics  derived  over  the  summer 
[Brow88b],  and  provides  a  flexible  tool  for  investigating  the  interaction  of  different  control  philo¬ 
sophies,  methods,  and  algorithms.  As  of  now  die  simulation  reflects  the  true  degree  of  dynamic 
control  we  can  exercise  over  die  robot:  not  very  much.  We  hope,  as  funding  permits,  to  imple¬ 
ment  mote  sofdnsticated  controllers  that  give  us  more  precise  control  over  the  robot  At  that 
point  the  value  of  a  simulator  will  be  questionable,  since  it  may  be  easier  to  use  the  device 
direedy  rather  than  try  to  simulate  it  at  the  required  level  of  detail. 


14 


9.7.1  The  Mechanism  and  the  Imaging 

The  simulated  mechanism  is  massless;  this  mirrors  the  effective  behavior  of  our  currem 
hardware  system  when  viewed  from  its  high-level  control  operations.  Its  geometry  captures  all 
the  essentials  of  the  head  and  eye  system  at  the  U.  Rochester.  The  robot  arm  is  not  modeled; 
rather  the  model  incoiporates  a  single  head  that  can  be  positioned  aitntrarily  in  space  (six 
degrees  of  fieedom:  thrM  in  position,  three  in  orioitation).  The  model  incoiporates  a  tilt  capabil¬ 
ity  that  affects  both  cameras,  and  a  separate  pan  capability  for  each  camera.  The  getxnetry  of  the 
ofiwts  of  the  various  axes  in  these  links  are  variaMe,  and  incorporate  the  gemnetrical  complexity 
of  the  real  system.  The  indqiendent  control  of  the  camera  pans  allows  us  to  model  modem 
dieories  of  saccadic  and  veigence  systems;  heads  with  mechanical  veigence  capability  (such  as 
those  at  Harvard  and  Boston  University)  must  use  older  models  of  these  systems. 

The  system  parameters  assumed  to  controllable  coirespond  to  one  set  of  VAL  robot  control 
parameters  (corresponding  to  the  VAL  "tool  coordinate  system")  for  the  head;  its  X,Y,Z  position 
and  A3.C  orientation.  Control  over  the  natural  pans  (left  and  right)  and  tilt  (common)  of  the 
cameras  is  also  assumed. 

The  camera  models  incorporate  point  projection  with  fixed  focal  length,  as  well  as  a 
"foveal-peripheral"  distinction  by  which  the  location  of  imaged  points  is  less  certain,  outside  a 
small  foveal  region,  depending  on  the  off-axis  angle  of  the  target  being  imaged.  The  target  itself 
is  a  single  point  in  3-D  space,  moving  under  dynamical  laws.  The  experiments  below  were  car¬ 
ried  out  with  the  target  point  in  elliptical  orbit  about  an  invisible  "black  hole"  -  thus  the  target 
followed  an  elliptical  path. 

It  is  assumed  that  the  imaging  system  knows  the  distance  to  the  target  (in  real  life,  this  dis¬ 
tance  may  be  derived  from  binocular  stereo,  a  priori  knowledge,  any  of  a  number  of  monocular 
distance  cues,  kinetic  depth  calculations,  etc.).  It  is  assumed  that,  for  each  eye,  the  instantaneous 
retinal  velocity  of  the  target  is  known  0.e.  the  veefer  difference  between  its  position  in  the  current 
image  and  its  position  in  the  last  image).  Odier  than  that,  the  system  only  knows  the  left  and 
right  image  (x,y)  location  of  the  target’s  image.  Of  course  the  target’s  image  position  and  hence 
image  velocity  is  perturbed  by  uncertainties  arising  from  the  blurriness  of  peripheral  vision, 
should  the  target  not  be  foveated. 


9.7.2  The  Control  Loops 

The  basic  control  loops  that  manage  the  Systran  are  in^iied  by  the  primate  visual  system. 
They  include  the  following  capabilities.  Although  not  described  in  detail  here,  much  is  known 
about  these  control  systems  in  animals  and,  to  some  extent,  in  man.  Two  good  references  are 
[BeitSS,  PeteSS]. 

It  is  worth  saying  that  the  implementation  of  each  of  these  control  loops  requires  only  a 
page  of  C  code  or  so,  and  that  at  this  stage  most  assumptions  and  technical  decisions  have  been 
for  the  sake  of  simplicity  rather  than  for  the  sake  of  faithfully  modeling  known  biological  systems 
or  optimal  mechanical  systems.  One  of  the  major  design  goals,  however,  is  that  the  system  have 
enough  richness  to  incorporate  more  detailed  control  models. 

Most  of  the  loops  have  several  parameters,  such  as  the  proportional,  integral,  and  derivative 
constants  of  their  controllers,  and  their  delays  and  latencies.  Delay  here  means  the  amount  of 
time  after  a  commanded  motion  before  it  commraices.  This  is  often  called  latency  in  the  litera¬ 
ture.  Latency  refers  to  the  time  required  to  execute  a  command;  it  is  a  time  constant  that  indi¬ 
cates  both  how  soon  another  command  can  be  accepted  and  how  long  the  command  will  be 
affecting  the  controlled  (velocity)  variables.  In  the  robot  system  the  delay  corresponds  to  how 
long  it  takes  the  mechanical  system  to  respond  to  a  motion  ordered  from  a  high  software  level, 
and  the  latency  reflects  how  long  it  takes  to  complete  a  command.  I  assume  that  "sensors" 
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(actually  robot  and  eye-control  motor  states  read  from  tl^ir  controllers)  are  available  to  the  sys¬ 
tem  immediately,  without  delay,  and  thus  reflect  the  "true"  state  of  the  wodd. 

There  are  five  separate  control  systems. 

1.  Saccade: 

This  is  a  fast  slewing  of  cameras  to  point  in  commanded  direction,  during  which  visual  pro¬ 
cessing  is  usually  considered  turned  off.  The  command  to  die  eyes  is  modeled  as  open 
loop,  though  there  are  such  things  as  "secondary"  saccades  that  correct  errors  in  initial  sac- 
cades.  The  saccadic  system  tries  to  foveate  the  target  and  to  match  eye  rotations  to  the  tar¬ 
get  velocity  so  as  to  be  tiaddng  the  target  as  soon  as  the  saccade  is  completed.  Current 
opinion  is  that  the  saccadic  system  is  aware  of  the  3-D  location  of  the  target,  not  just  the 
location  of  ifs  retinal  image.  In  the  implementation  used  for  the  experiments  below,  it 
operates  with  retinal  locations  and  velocities,  not  3-D  locations  or  distance.  The  left  eye  is 
dominant  in  the  system.  The  saccade  aims  to  center  the  target  image  on  the  fovea  of  the  left 
eye;  the  ri^t  eye  is  panned  by  the  same  amount  (and  of  course  tilted  by  the  same  amount 
for  mechanical  reasons).  Thus  the  saccade  maintains  the  current  vergence  angle.  It  is 
implemented  as  a  constant-^)eed  sleeving  of  all  three  pan  and  tilt  axes,  with  one  of  them 
attaining  a  system  constant  maximum  velocity.  The  slewing  continues  until  the  target 
should  be  foveated  (it  my  not  be  due  to  periidieral  blurring),  at  which  time  the  system  is 
left  with  eye  velocities  that  match  the  perceived  target  motion  before  the  saccade.  The  sac¬ 
cadic  system  is  characterized  by  its  maximum  velocity  and  its  delay. 

2.  Smooth  Pursuit: 

The  eyes  trade  a  moving  target,  using  retinal  slip  as  a  control  irq)ut.  The  error  here  is  target 
position  in  eadi  eye,  (which  should  be  (0,0)),  and  the  commands  (here  sent  to  both  eyes 
independently:  neither  eye  is  dominant)  are  pan  and  tilt  velocities.  The  pursuit  system  has 
delay,  latency,  and  PID  control. 

3.  Vergence: 

The  vergence  system  measures  disparity  between  the  target  posititm  in  the  left  and  right 
eyes,  and  pans  the  right  eye  to  reduce  it  The  vergence  system  has  delay,  latency,  and  PID 
control. 

4.  VestilHilo-Ocular  System: 

The  VOR  system  is  open  loop  in  the  sense  that  its  ir^ts  come  from  the  head  positioning 
system  and  its  ouQnits  go  to  the  eye  positioning  system.  Its  purpose  is  to  stabilize  eyes 
against  head  motion,  and  its  inputs  are  thus  derivatives  of  head  position  (XYZ  velocities, 
ABC  angular  velocities).  It  also  uses  the  distance  of  the  target,  since  die  impropriate 
response  to  a  close  target  is  different  from  that  to  a  far  target.  ActuaUy  the  VOR  should  be 
im^emented  by  inverse  kinematics,  to  which  the  current  implementation  (and  presumably 
the  neural  one)  is  an  approximatioa  Its  output  is  commands  to  the  pans  and  tilt  controls  to 
null  out  the  apparent  target  motion  caused  by  head  motitm.  It  is  characterized  by  delay, 
latency,  and  (pen  l(X)p  proportional  gain. 

5.  Platform  Compensation: 

This  system  is  a  head-control,  not  gaze-control  sy^m.  These  systems  are  known  to 
interaa  in  subtle  and  complex  ways,  but  this  particular  reflex  simply  attempts  to  keep  the 
eyes  "centered  in  the  head",  so  that  die  camera  pans  or  tilts  are  kept  within  "comfortable" 
mechanical  ranges.  The  "comfort  function"  is  a  nonlinear  one  x/((x-xmax)'2),  where  x  is 
either  die  average  pan  angle  or  the  tilt  angle,  and  in  either  case  xmax  is  the  physically 
imposed  limit  of  dre  system.  This  reflex  is  tpm  lo(p  (eye  position  affects  head  position), 
with  delay,  latoicy,  and  (pen  lo(p  proporticmal  gain. 

The  system  operates  in  two  modes:  smooth  pursuit  and  saccade.  During  saccades,  the  ver¬ 
gence  and  saccade  reflexes  are  running.  While  this  particular  disparity-driven  implementation  of 
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veisenoe  is  presumaUy  not  implemented  by  primstes.  they  do  have  the  "near  triad"  reflex  which 
includes  boA  veiigence  and  pt^  diameter  i^uctioa  in  saccades  between  near  and  far  taigets. 
My  indusion  of  the  veigence  during  saccades  is  a  way  to  implement  2/3  of  the  near  triad.  In 
smomh  pursuit  mode,  the  VOR,  platfonn  ccmipensation.  pursuit,  and  veigence  systems  are  nm- 
ning. 

The  delays  and  latencies  are  implemented  with  a  cmnmand  pipeline,  in  which  the  com- 
maiKtod  dumges  in  velocities  are  entered  opposite  the  time  in  die  fiiture  they  are  to  take  effect 
Hme  is  discretized  to  some  level.  Delays  result  in  later  entry  of  comma^.  Latoides  are 
imi^emented  by  dividing  the  commanded  change  between  as  many  discrete  time  periods  as 
necessary  to  qnead  die  effect  over  the  latency.  The  pipeline  dius  is  indexed  by  (future)  time 
instant  and  it  has  entries  that  hold  die  commanded  velocities  fw  the  six  head  degrees  of  freedom 
and  diree  camera  degrees  of  fieedom.  Each  instant  also  has  an  entry  corresponding  to  its  mode 
(saccadic  or  pursuit).  The  pipeline  has  finite  tengdt  and  is  in  fact  im^ement^  as  a  ring  buffer. 

For  die  experiments  carried  out  below,  die  combination  of  control  effects  takes  place  when 
a  reflex  (say  VOR)  increments  or  decremertts  a  velocity  term  in  the  pipeline.  The  increment  or 
decrement  is  made  to  the  current  value  which  may  be  nonzero  because  of  ir^xit  from  another 
reflex  (say  tracking).  Thus  the  control  commands  are  summed  in  the  simplest  possiUe  way.  as  if 
eadi  ctmtrol  system's  ouQxxt  was  a  D.C.  voltage  and  all  the  ouQxits  were  soldered  together  at  die 
effector  motor’s  input 

The  saccadic  system  shuts  down  die  pursuit  system  in  the  sense  that  for  the  duration  of  the 
saccade  (which  is  computed  from  die  image  distarice  it  must  move  the  fovea  and  the  maximum 
velocity  it  can  move),  aU  other  commands  in  die  pipeline  are  overwritten,  and  the  mode  is 
changed  to  "saccade".  Further  commands  trying  to  affea  these  instants  are  ignored.  If  the  sys¬ 
tem  is  in  pursuit  mode,  command  velocities  are  summed  as  mentimied  in  the  previous  paragiaidi. 
The  system  is  diagramed  in  the  flow  chart  shown  in  figure  9.7.1. 


9.7  J  An  Experiment  on  Control  Interactions 

The  following  sequence  of  graphs  illustrates  the  cumulative  effect  of  adding  control  capa¬ 
bilities  in  the  manner  outlined  above:  they  operate  independently  and  their  outputs  are  simply 
summed  at  die  effectors.  For  this  run  die  delays  and  latencies  were  both  set  to  constant  low 
values.  Lai^e  delays  would  have  a  very  serious  effect  on  the  performance  of  the  system,  and  is  a 
serious  issue.  In  the  work  at  Harvard  it  is  addressed  by  positive  feedback,  which  reduces  the 
delay  effects  (and  is  also  good  for  putting  actions  into  "h^  frame"  instead  of  "retinal  flame"). 

In  each  case,  the  system  is  tracking  (or  acquiring  and  tracking)  a  target  moving  around  in  an 
ellipse  whose  plane  is  parallel  to  the  image  plane.  Each  graidi  plots  tracking  error  in  x  and  y  as  a 
flmction  of  time  for  the  left  and  right  eye.  Since  the  goal  is  to  foveate  the  target  at  (0,0),  the 
gra(dis  also  show  the  image  coordinates  of  the  target  through  time.  There  are  actually  four  gra{to 
per  figure,  but  often  the  left  and  tight  y  errors  are  identical  because  both  cameras  have  the  same 
tilL  Basically,  figures  9.7.2  -  9.7.7  illustrate  the  increasingly  effective  final  behavior  of  the  sys¬ 
tem  as  reflexes  are  added. 


9.7.4  Learning  and  Adaptation 

The  simulated  system  can  support  other  relevant  aspects  of  the  control  problem,  including 
the  important  orre  of  adt^ting  to  changes  in  the  plant.  The  one  application  here  that  1  have  imple¬ 
mented  involves  using  "the  MIT  rule",  which  is  a  gradient  descent  method  similar  to  back- 
propagation  learning  in  neural  nets,  to  learn  part  of  the  robot  head  geometry.  In  a  way  this  sys¬ 
tem  acts  like  another  control  system,  which  inputs  the  discrepancies  between  expected  and 
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Figure  9.7.1:  The  five  control  loops  in  the  robot  head  simulation. 
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Figure  9.1  J.:  Tracking  only 

Cameras  are  assuming  mechanically  impossible  pans  and  tilts  for  which  I  am  not  checking  expli¬ 
citly.  If  I  were  modeling  maximum  exoirsion.  tracking  error  would  at  some  point  rapidly 
increase  as  they  cameras  hit  their  stops. 


Figure  9.7  J:  Head  compensation  added 

To  keep  the  eyes  centered  in  the  head,  this  reflex  moves  flie  head  in  the  same  direction  the  eyes 
are  moving.  Unfortunately  in  this  case  such  a  move  amplifies  the  tracking  corrections,  overcom¬ 
pensates,  and  renders  the  system  unstable. 
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VOR’s  function  is  to  compensate  for  head  motion.  In  this  case  the  head  motion  was  engendered 
by  the  head  compensation  reflex,  which  itself  was  driven  by  the  tracking  motions  of  the  eyes. 
VOR  effectively  stabilizes  the  system,  which  may  now  be  imagined  as  moving  both  its  eyes  and 
head  to  track  the  object. 


Here  the  system  makes  an  initial  saccade  to  place  the  target  on  the  fovea  and  match  its  velocity. 
The  left  eye  dominance  and  the  implementation  that  preserves  vergence  during  saccades  (the  R 
eye  gets  the  same  pan  velocity  agnal)  means  that  the  ri^  eye  is  slewed  off  target  as  die  left  eye 
is  slewed  on  target 
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For  this  experiment  the  effects  of  peripheral  blurring  were  increased,  so  the  saccade  is  actually 
made  to  an  inaccurate  location. 


Figure  9.7.7: 

Here,  conditions  were  as  for  Fig.  9.5.6  but  the  vergence  system  was  turned  on.  It  tries  to  drive 
the  target  image’s  disparity  between  the  left  and  right  eyes  to  zero,  and  results  in  improved  per¬ 
formance. 
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observed  target  motions  given  eye  motions,  and  outputs  are  parameters  to  the  modeled  plant  (in 
this  case,  lengths  of  links  in  the  head  kinematic  chain). 


9.8  Visual  Navigation 

In  faU  1988  Randal  Nelson  joined  the  faculty  and  the  vision  group  at  Rochester  having 
completed  his  PhD  at  the  University  of  Maryland,  llte  dissertation  research  involved  the  descrip¬ 
tion  and  implementation  of  a  set  of  foundational  al^des  for  visual  navigation.  In  particular, 
visual  meth^  were  described  for  performing  passive  navigation,  obstacle  avoidance,  and  hom¬ 
ing  in  general,  teal-woiid  environments  [Nels88d].  This  work  fits  nicely  within  the  framework  of 
active  vision  which  the  group  is  currently  pursuing  and,  since  he  intends  to  continue  work  along 
similar  lines,  a  summary  of  the  dissertation  results  follows. 

Passive  navigation  is  a  process  by  which  a  system  obtains  information  about  its  rotation  and 
translational  modoa  This  information  is  useful  in  navigation  to  stabilize  and  direct  the  motion  of 
the  system.  Visual  methods  attempt  to  obtain  the  motion  parameters  from  a  time-series  of  images 
[Gib^O,  Praz80,  Hom81,  Hild83.  Lawt83,  Long84.  Koen86].  The  problem  is  hard  because  solu¬ 
tion  methods  tend  to  be  extremely  sensitive  to  small  errors  in  the  input,  while  accurate  image 
flow  or  point  correspondence  informatiim  is  difficult  to  obtain  [Tsai81,  Adiv8S.  Anan8S.  Nage86, 
Verr87].  The  dissertation  shows  how  accurate  motitm  parameters  can  be  obtained  from  inaccu¬ 
rate  flow  data  by  utilizing  image  information  over  Am  entire  visual  sphere  [Nels88a].  Essentially 
global  topologi^  constraints  are  used  to  stabilize  the  process.  It  is  irtteresting  to  note  that  such 
spherical  images  are  available  to  flying  insects  such  as  bees  and  dragonflies,  so  there  is  a  biologi¬ 
cal  precedent 

Obstacle  avoidance  refers  to  the  ability  of  a  system  to  move  about  in  the  environment 
without  striking  the  objects  in  it.  This  is  a  fundamental  navigational  behavior.  It  is  shown  fluit 
computation  of  divergence-like  properties  of  the  visual  flow  field  provides  qualitative  cues  which 
are  invariant  under  rotation  of  the  system  and  whidi  are  sufficient  to  permit  the  system  to  avoid 
collisions.  The  method  is  applicable  in  general  oivirorunents,  the  only  requirement  being  the 
presence  of  sufficient  visual  texture  to  allow  the  image  flow  to  be  roughly  approximated.  Empiri¬ 
cal  measuremoits  show  that  sufficient  texture  is  presoit  in  ordinary  objects  such  as  stones,  trees, 
and  faces.  The  method  was  implemented  and  us^  successfully  to  control  the  motion  of  a  camera 
in  various  environments. 

Homing  is  the  process  by  which  an  autonomous  system  guides  itself  to  a  particular  location 
on  the  basis  of  soisory  input  This  is  a  slightly  more  sophisticated,  but  still  fundamental  naviga¬ 
tional  ability.  In  the  dissertation,  a  metlKHl  of  visual  homing  using  an  associative  memory 
[Hint81,  Ackl8S,  Rume86,  Smol86,]  based  on  a  simple  pattern  classifier  is  described  [Nels88c]. 
Homing  is  accomplished  without  the  use  of  an  explicit  world  model  by  utilizing  direct  assoda- 
tions  between  learned  visual  patterns  and  system  motor  commands.  The  techruque  is  analyzed  in 
terms  of  a  pattern  space  and  conditions  obtained  which  allow  the  system  performance  to  be 
predicted  on  the  basis  of  statistical  measurements  on  the  envirorunent  The  method  was  imple¬ 
mented  and  used  to  guide  a  robot-mounted  camera  in  a  three-dimensional  environment. 
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MISSION 


Rome  Air  Development  Center 


RADC  plans  and  executes  research,  development,  test  and 
selected  acquisition  programs  in  support  of  Command,  Control, 
Communications  and  Intelligence  (C*I)  activities.  Technical  and 
engineering  support  within  areas  of  competence  is  provided  to 
E5D  Program  Offices  (POs)  and  other  E5D  elements  to 
perform  effective  acquisition  of  (PI  systems.  The  areas  of 
technical  competence  include  communications,  command  and 
control,  battle  management  information  processing,  surveillance 
sensors,  intelligence  data  collection  and  handling,  solid  state 
sciences,  electromagnetics,  and  propagation,  and  electronic 
reliability /maintainability  and  compatibility. 


